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(57) ABSTRACT 

Disclosed is a method and system for collecting profile 
information about users accessing dynamically generated 
content from one or more servers. In a specific embodiment, 
a server dynamically generates a web page in response to a 
user request. The server customizes the web page content 
based on the requested universal resource identifier (URI) 
and one or more of: the user's identity, access permissions, 
demographic information, and previous behavior at the site. 
The web server then passes the URI, user identity, and 
dynamically generated web page to an access information 
collector. The access information collector generates docu- 
ment comparators from the current web page content and 
compares them to document comparators associated with 
previously retrieved web pages. If the current web page is 
sufficiently similar to some previously retrieved web page, 
the access information collector logs the URI, user identity, 
and a document key associated with the matching previously 
retrieved page. Otherwise, the access information collector 
generates a new key; stores the new key and the document 
comparators in a database; and logs the URI, user identity, 
and the newly generated document key. 

27 Claims, 4 Drawing Sheets 




05/03/2004, EAST Version: 1.4.1 




05/03/2004, east version: 1.4.1 



U.S. Patent 



Feb. 6, 2001 



Sheet 2 of 4 



US 6,185,614 Bl 



Static 
Content 
Database 



230 



Dynamic 
Content 
Generator 



220 



210 



Static 
Content 
Database 



Dynamic 
Content 
Generator 




CLIENT 



200 



CLIENT 

<_ 



201 



Static 
Content 








Web 
Server 



232 



Dynamic 
Content 
Generator 



212 



CLIENT 



202 



FIG. 2 



05/03/2004, EAST version: 1.4.1 



U.S. Patent 



Feb. 6, 2001 Sheet 3 of 4 



US 6,185,614 Bl 



CO 

d 

E 



CM 

T— 

CO 



\ 



CO 
CO 



\ 



o 

CO 



\ 



E % 

i 



3 



r 
i 

E 





CNJ 
CO 



\ 



c 




05/03/2004, EAST Version: 1.4.1 



U.S. Patent Feb. 6, 2001 Sheet 4 of 4 



US 6,185,614 Bl 



Receive URI, request time, y 
client identity, and document 
content 



i 



400 



Compute Document Y 
Comparator 



402 



Select Candidate 
Document and Comparator 



404 




No 









Generate new Key for 
retrieved document 



Retrieve Document Key 
for Candidate Document 



410 



Add entry to Log File with 
Candidate Document's Key 



415 



f -490 



420 



Add new entry to Retrieved y 
Document Database 



425 



Add new entry to Document y 
Comparator Index Database 



i 



.430 



Add entry to Log File with / 
retrieved Document's Key 

i 



435 



FIG. 4 



05/03/2004, EAST Version: 1.4.1 



US 6,U 

1 

METHOD AND SYSTEM FOR COLLECTING 
USER PROFILE INFORMATION OVER THE 
WORLD-WIDE WEB IN THE PRESENCE OF 
DYNAMIC CONTENT USING DOCUMENT 
COMPARATORS 

FIELD OF THE INVENTION 

This invention relates in general to computer software, 
and in particular to a method and system for collecting 
profile information about users accessing Web pages from a 
plurality of Web servers. More particularly, the present 
invention relates to a method and system by which user 
profile information can be collected when the Web content 
is generated dynamically for each request at the Web server. 

BACKGROUND OF THE INVENTION 

In the World-Wide Web, a content provider deploys a 
plurality of Web servers that deliver Web pages to clients. 
When requesting a Web page, the client supplies a Uniform 
Resource Locator (URL) or Universal Resource Identifier 
(URI) to the server. The server associates this URI with a 
particular page of content and delivers that information to 
the requesting client. 

As the World-Wide Web is being used increasingly to 
support commerce and targeted advertising, content provid- 
ers desire to collect information about which users are 
accessing the site and what site content those users are 
accessing. This information can be used to establish "pro- 
files" for each site visitor and enable tuning of the Web site 
content to meet the visitors* interests. Traditionally, this 
visitor information is collected by the Web server or a proxy 
server in the form of a log file. This log file contains, among 
other things, the requesting host address, the requested URI, 
and the time at which the request was received. Because 
each URI represents a particular piece of static content at the 
Web site, the URI is sufficient for a user profile analyzer to 
evaluate which content was received by each user and to 
detect similarities among the behavior of different users. 

Recent Web servers are providing support for server-side 
scripting, whereby the URI is associated with a program or 
script that is executed at the Web server. This script is 
responsible for receiving the URI and the user identity and 
using this information to dynamically generate the content 
that should be returned to the requesting user. This generated 
content may account for the user's previous behavior at the 
site, his access permissions, his demographic information, or 
any number of other factors. Dynamic server content is 
supported by most Web servers today, including Microsoft's 
Active Server Pages, Sun's Dynamic Server Pages, industry- 
standard servlets, Common Gateway Interface (CGI) 
executables, and other mechanisms. 

As a result of this direction, a particular URI can no longer 
be associated with particular content at the Web site. On 
different requests, the URI may return wholly different 
content depending on the requesting user and the context in 
which the request was issued. Consequently, existing meth- 
ods for capturing user information are insufficient for pro- 
ducing meaningful user profiles. More specifically, the reli- 
ance on URIs alone prevents the accurate characterization of 
which users are exhibiting similar access behavior. 
Therefore, a method is needed for efficiently collecting user 
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access information in the presence of dynamically-generated 
content at a Web server, in order to support the accurate 
generation of user profiles. 

5 SUMMARY OF THE INVENTION 

One object of the present invention is to provide, within 
a networked environment, a method of associating each 
user's request for World-Wide Web information to the 

10 content of the retrieved document when that document was 
generated dynamically. 

Another object of the present invention is to group 
together user requests that retrieve the same document 
content. Yet another object of the present invention is to 
ignore minor variations in document content as might occur 
when the documents differ only in the presence of the 
requesting user's name. Still yet another object of the 
present invention is to enable the use of a range of metrics 

20 for comparing two documents for similarity. 

To achieve the foregoing objects and in accordance with 
the purpose of the invention as broadly described herein, a 
method and system are disclosed for collecting information 

25 about user accesses by analyzing the content of retrieved 
documents and associating Document Comparators with 
each document. These and other features, aspects, and 
advantages of the present invention will become better 
understood with reference to the following description, 

30 appended claims, and accompanying drawings. 

BRIEF DESCRIPTION OF THE DRAWINGS 

For a more complete understanding of the present inven- 
35 tion and for further advantages thereof, reference is now 
made to the following Detailed Description taken in con- 
junction with the accompanying Drawings, in which: 

FIG. 1 is a pictorial representation of a data processing 
system which may be utilized to implement a method and 
40 system of the present invention; 

FIG. 2 shows a block diagram of a World-Wide Web 
environment in which user access information may be 
generated in accordance with the present invention; 
45 FIG. 3 shows a sample data structure for representing the 
information collected by the Access Information Collector in 
accordance with the present invention; and 
FIG. 4 is a flowchart showing how an Access Information 
50 Collector analyzes a document retrieved from a Web server 
and updates its data structures. 

DETAILED DESCRIPTION OF THE 
INVENTION 

55 Referring to FIG. 1, there is depicted a graphical repre- 
sentation of a data processing system 8, which may be 
utilized to implement the present invention. As may be seen, 
data processing system 8 may include a plurality of 
networks, such as Local Area Networks (LAN) 10 and 32, 

60 each of which preferably includes a plurality of individual 
computers 12 and 30, respectively. Of course, those skilled 
in the art will appreciate that a plurality of Intelligent Work 
Stations (IWS) coupled to a host processor may be utilized 

65 for each such network. Each said network may also consist 
of a plurality of processors coupled via a communications 
medium, such as shared memory, shared storage, or an 
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interconnection network. As is common in such data pro- may take many forms, including Active Server Pages, 

cessing systems, each individual computer may be coupled servlets, Common Gateway Interface (CGI) binaries, or 

to a storage device 14 and/or a printer/output device 16 and Dynamic Server Pages. 

may be provided with a pointing device such as a mouse 17. ^ Up0Q receiving a URI request from a clien t, the Web 

The data processing system 8 may also include multiple 5 2 io, 211, or 212 either retrieves the content from the 

mainframe computers, such as mainframe computer 18, Static Content Database 220, 221, or 222 or from the 

which may be preferably coupled to LAN 10 by means of Dynamic Content Generator 230> ^ or 232. M Access 

communications link 22. The mainframe computer 18 may Informatioa Collector 240 receives chent sts and ^ 

also be coupled to a storage device 20 which may serve as ^ ^ kM ^ ^ ^ Qr 

remote storage for LAN 10. Similarly, LAN 10 may be . . _ . _ . _ t w 

, , . & ... . ' . , ' 222 or from the Dynamic Content Generator 230, 231, or 

coupled via communications link 24 through a sub-system . „ A . . , . . . . ' , 

, . . tJ ... . „ . . 232 and collects log information that can be used to analyze 

control unit/communications controller 26 and commumca- , * . ¥ t , 

nn m ~ 0 the access patterns of various users. It should be understood 

tions link 34 to a gateway server 28. The gateway server 28 t . , , . . n „ 

c ui ixxjo tft. * riTAXTM* iam that the physical location of the components shown in FIG. 

is preferably an IWS which serves to link LAN 32 to LAN 15 ^ w - t . * , r . ^ « 

2 may vary. In particular, the Access Information Collector 

* r ^ , , , e 240 may be embedded in the Web servers 210, 211, and 212. 

With respect to LAN 32 aad LAN 10 a plurality of ^ ^ - c Content Generatots m m and 

documents or resource objects may be stored within storage m and ^ Content Databases220j 221( and 222 may ta 

device 20 and controlled by mainframe computer 18, as . . , ... w , , 1A , 

17 . „ ^ , 20 co-located with the Web servers 210, 211, and 212. 
resource manager or library service for the resource objects 

thus stored. Of course, those skilled in the art will appreciate FIG ' 3 illustrates the information collected by the Access 

that mainframe computer 18 may be located a great geo- ^formation Collector in accordance with the present inven- 

graphic distance from LAN 10 and similarly, LAN 10 may tion. ALog File 300 contains a sequence of Access Records, 

be located a substantial distance from LAN 32. For example, 2 5 Each Access Record includes at least a time stamp 301, a 

LAN 32 may be located in California while LAN 10 may be requested URI 313, and a Document Key 312. 

located within North Carolina and mainframe computer 18 A Retrieved Document Database 310 contains a reposi- 

may be located in New York. tory of Document Records corresponding to documents 

Software program code which employs the present inven- retrieved by users. Each Document Record 311 is indexed by 

tion is typically stored in the memory of a storage device 14 30 a Document Key 312 and contains an associated URI 313, 

of a stand alone workstation or LAN server from which a document text 314, and a Document Comparator 315. The 

developer may access the code for distribution purposes, the Document Key 312, when combined with the URI 313, 

software program code may be embodied on any of a variety serves to uniquely identity the Document Record 311. Docu- 

of known media for use with a data processing system such 35 ment ^ e y s mav De assigned sequentially or by any other 

as a diskette or CD-ROM or may be distributed to users from appropriate method. 

a memory of one computer system over a network of some The Document Comparator 315 is a representation of the 

type to other computer systems for use by users of such other document's contents and is used by a Document Comparator 

systems. Such techniques and methods for embodying soft- Function to determine whether there are substantial pre- 

ware code on media and/or distributing software code are 40 defined similarities, as will be subsequently described in 

well-known and will not be further discussed herein. greater detail, between the current document and other 

Referring now to FIG. 2, components of a World-Wide previously retrieved documents. The Document Comparator 

Web system are shown in which user information may be Function receives the Document Comparators for two docu- 

gathered in accordance with the present invention. A plu- 45 ments and determines whether the two documents are sub- 

rality of clients (generally indicated by reference numerals stantially similar. To make this determination, the Function 

200, 201, and 202) access information over a network 205 may employ a Document Difference Threshold, a numeric 

using World-Wide Web browsers such as NETSCAPE value that indicates how much two documents may differ 

NAVIGATOR, a trademark of Netscape, Inc. or before they are no longer deemed to be substantially similar. 

MICROSOFT INTERNET EXPLORER, a trademark of 50 The use of the Document Difference Threshold depends on 

Microsoft, Inc. These clients access a plurality of Web the particular Document Comparator Function being used, 

servers (generally indicated by reference numerals 210, 211, The use of a Document Difference Threshold allows the 

and 212) such as LOTUS GO, a trademark of Lotus, Inc., Document Comparator Function to ignore minor differences 

MICROSOFT INTERNET INFORMATION SERVICE 55 between two documents. Such minor differences include 

(IIS), a trademark of Microsoft, Inc. or NETSCAPE timestamps, client name, or client-specific data. 

FAS 11 RACK, a trademark of Netscape, Inc. In the present embodiment of this invention, the Docu- 

In accessing these Web servers, the clients 200, 201 and ment Comparator 315 is the actual content of the document 

202 specify a URI. Each of these Web servers 210, 211, and itself, and the Document Comparator Function for any two 

212 accesses a Static Content Database (generally indicated 60 documents is defined to be the number of character 

by reference numerals 220, 221, and 222) and a Dynamic insertions, deletions, or modifications required to convert 

Content Generator (generally indicated by reference numer- one document to the other. This computation is well under- 

als 230, 231, and 232) that receives a URI and other stood in the prior art (see, for example, the use of tries, as 

information about the user and generates Web content suit- 65 described in Chapter 11 of Alan Tharp, File Organization 

able for display by the browsers at the clients 200, 201, and and Processing, Wiley, 1988) and will not be discussed 

202. These Dynamic Content Generators 230, 231, and 232 further. Alternative embodiments of this invention may 
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compute a Document Comparator 315 by mapping each Referring now to FIG. 4, a flowchart depicts the steps 
word, paragraph, or section of the document to a binary taken by the Access Information Collector 240 to analyze a 
token. In this case, the Document Comparator Function document retrieved from a Web server and to update the Log 
might count the number of matching binary tokens, and the File 300, Retrieved Document Database 310, and Document 
Document Difference Threshold would designate what per- 5 Comparator Index 320 (as shown in FIG. 3). At block 400, 
centage of the tokens must match (see, for example, "Copy the Access Information Collector 240 receives the requested 
Detection Mechanisms for Digital Documents," by Sergey URI, the time of the request, the identity of the requesting 
Brin, James Davis, and Hector Garcia-Molina, in Proceed- client, and the content of the retrieved document. At block 
ings of the 1995 SIGMOD International Conference on 402, a Document Comparator 315 is computed for the 
Management of Data, pages 398-409, May 1995). Yet retrieved document. At block 404, a Candidate Document 
another embodiment of this invention may define a Docu- and Candidate Document Comparator are selected from the 
ment Comparator 315 as a list of the most significant (as Retrieved Document Database 310. The Candidate Docu* 
predefined) words or phrases in the document; the Docu- ment is a document in the Retrieved Document Database 
ment Comparator Function may simply count how many 15 310 whose URI matches that of the retrieved document. (It 
words or phrases occur in both documents, and the Docu- should be understood that alternative embodiments of this 
ment Difference Threshold would designate what percentage invention may remove the restriction that the URI of the 
of words in each document must appear in the other. Other retrieved document and the URI of the Candidate Document 
comparison methods are well established in the prior art. match. Alternative embodiments of this invention may also 
The essential element of a Document Comparator 315 is that introduce additional restrictions on what constitutes a Can- 
a metric (i.e. the Document Comparator Function) must didate Document.) At decision block 406, it is determined 
exist for comparing two different Document Comparators to whether or not a Candidate Document has been found. If the 
determine by how much their respective documents differ. answer to decision block 406 is yes, then at decision block 
Indeed, a Document Comparator 315 may actually comprise 25 408, the Document Comparator Function is invoked with the 
multiple Comparators, one per each predefined section of Document Comparators of the retrieved document and of the 
the document, each having an associated Document Com- Candidate Document to determine whether or not the 
parator Function. retrieved document and the Candidate Document are sub- 
Finally, a Document Comparator Index 320 associates stantially similar, 
each Document Comparator 315 with the corresponding Continuing with FIG. 4, if the answer to decision block 
Document Key 312. The Index 320 is used to improve the 408 is yes, then it is determined that the retrieved document 
performance of the Document Comparator 315 evaluations is sufficiently similar to the Candidate Document and no new 
and the selection of Candidate Documents (see FIG. 4). entry is required to either the Retrieved Document Database 
However, it is a performance optimization that may be 35 310 or to the Document Comparator Index 320. At block 
omitted by alternative embodiments of this invention. 410, the Document Key is retrieved for the Candidate 
Though the data structures have been illustrated in FIG. 3 Document. At block 415, a new entry is added to the Log 
with a particular embodiment, alternative representations of File, including the time stamp, requested URI, and candidate 
this information are possible. The essential attributes of document's Document Key. The process then terminates at 
these implementations is the association of each Document 40 block 490. If the answer to decision block 408 is no, then 
Comparator 315 to a Document Key 312, the association of control returns to block 404, where another Candidate 
each user URI 313 retrieval with a particular Document Key Document is selected for evaluation. 
312, and the association of each Document Key 312 with If the answer to decision block 406 is no, then it is 
particular document content. It should be noted that various 45 determined that the retrieved document is new. At block 420, 
optimizations are also possible. For example, instead of a new Document Key is generated for the retrieved docu- 
storing each document's full content, the Retrieved Docu- ment. At block 425, a new entry is added to the Retrieved 
ment Database 310 may store only a list of most significant Document Database 310 to associate the retrieved docu- 
words or phrases. ment's Document Key with a new Document Record con- 
When a document is accessed from the Web server (with 50 taining the retrieved URI, retrieved document, and retrieved 
a particular URI), the Access Information Collector 240 document's Document Comparator. At block 430, a new 
analyzes the retrieved document (using the Document Com- entry is added to the Document Comparator Index 320 
parator Function) to determine whether it is substantially database to associate the retrieved document's Document 
similar to another document that has been previously 55 Comparator with the retrieved document's Document Key. 
retrieved from that Web server using the same URI. If a At block 435, a new entry is added to the Log File, including 
substantially similar document has already been generated the time stamp, requested URI, and retrieved document's 
by the Web server, then the user's access is associated with Document Key. The process then terminates at block 490. 
that previous document; however, if a substantially similar Thus, each user access is associated with a Document Key 
document has not been previously generated by the Web 60 representing a document in the Retrieved Document Data- 
server, then the user's access is associated with this new base with a sufficiently close Document Comparator. Each 
document. In this way, the Access Information Collector 240 URI is, therefore, potentially linked with multiple 
distinguishes between different dynamically-generated documents, each having different content. At the same time, 
documents retrieved using the same URI while also merging 65 the analysis ignores minor differences between documents, 
access information about documents that are nearly identi- as might arise when page content is customized in minor 
cal. ways to reflect the identity of the requesting user. 



05/03/2004, EAST Version: 1.4.1 



US 6,185,614 Bl 



8 



Although the present invention has been described with 
respect to a specific preferred embodiment thereof, various 
changes and modifications may be suggested to one skilled 
in the art and it is intended that the present invention 
encompass such changes and modifications as fall within the 
scope of the appended claims. 

What we claim is: 

1. A method of collecting information about document 
retrievals over the World-Wide Web, comprising the steps 
of: 

receiving a requesting user identity, requested Universal 
Resource Identifier (URI), and a content of a retrieved 
document; 

selecting a Candidate Document from a Retrieved Docu- 
ment Database, said Candidate Document associated 
with a Candidate Document Key; 

comparing said retrieved document to said Candidate 
Document to determine a sufficiency of said Candidate 
Document; 

associating said retrieved document with a newly gener- 
ated Retrieved Document Key if said Candidate Docu- 
ment is not deemed to be sufficient; 

adding said retrieved document to said Received Docu- 
ment Database; and 

adding a Log File Entry including said requesting user 
identity, said requested URI, and said Retrieved Docu- 
ment Key. 

2. The method of claim 1, wherein each of a plurality of 
documents in said Retrieved Document Database is associ- 
ated with a Document Comparator and wherein a first 
Document Comparator may be compared to a second Docu- 
ment Comparator using a Document Comparator Function. 

3. The method of claim 2, wherein said step of comparing 
to determine a sufficiency of said Candidate Document 
further comprises the steps of: 

computing said first Document Comparator for said 

retrieved document; 
retrieving said second Document Comparator for said 

Candidate Document; 
computing with said Document Comparator Function a 

numeric measure of a difference between said first 

Document Comparator and said second Document 

Comparator; and 
comparing said numeric measure against a predefined 

Document Difference Threshold. 

4. The method of claim 2, wherein each said Document 
Comparator comprises content of said each of a plurality of 
documents associated therewith. 

5. The method of claim 4, wherein a URI for said 
Candidate Document is equal to a URI for said retrieved 
document. 

6. The method of claim 2, wherein each said Document 
Comparator is computed by associating predefined portions 
of said each of a plurality of documents to a binary token. 

7. The method of claim 2, wherein each said Document 
Comparator comprises a list of significant words or phrases 
in said each of a plurality of documents. 

8. The method of claim 2, wherein each said Document 
Comparator comprises a Comparator for each of a plurality 
of predefined sections of said each of a plurality of docu- 
ments. 

9. The method of claim 2, wherein said step of selecting 
a Candidate Document comprises selecting from a Docu- 
ment Comparator Database. 
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10. A system for collecting information about document 
retrievals over the World-Wide Web, comprising: 

means for receiving a requesting user identity, requested 
Universal Resource Identifier (URI), and a content of a 
retrieved document; 

means for selecting a Candidate Document from a 
Retrieved Document Database, said Candidate Docu- 
ment associated with a Candidate Document Key; 

means for comparing said retrieved document to said 
Candidate Document to determine a sufficiency of said 
Candidate Document; 

means for associating said retrieved document with a 
newly generated Retrieved Document Key if said Can- 
didate Document is not deemed to be sufficient; 

means for adding said retrieved document to said 
Received Document Database; and 

means for adding a Log File Entry including said request- 
ing user identity, said requested URI, and said 
Retrieved Document Key. 

11. The system of claim 10, wherein each of a plurality of 
documents in said Retrieved Document Database is associ- 
ated with a Document Comparator and wherein a first 
Document Comparator may be compared to a second Docu- 
ment Comparator using a Document Comparator Function. 

12. The system of claim 11, wherein said means for 
comparing to determine a sufficiency of said Candidate 
Document further comprises: 

means for computing said first Document Comparator for 
said retrieved document; 

means for retrieving said second Document Comparator 
for said Candidate Document; 

means for computing with said Document Comparator 
Function a numeric measure of a difference between 
said first Document Comparator and said second Docu- 
ment Comparator; and 

means for comparing said numeric measure against a 
predefined Document Difference Threshold. 

13. The system of claim 11, wherein each said Document 
Comparator comprises content of said each of a plurality of 
documents associated therewith. 

14. The system of claim 13, wherein a URI for said 
Candidate Document is equal to a URI for said retrieved 
document. 

15. The system of claim 11, wherein each said Document 
Comparator is computed by associating predefined portions 
of said each of a plurality of documents to a binary token. 

16. The system of claim 11, wherein each said Document 
Comparator comprises a list of significant words or phrases 
in said each of a plurality of documents. 

17. The system of claim 11, wherein each said Document 
Comparator comprises a Comparator for each of a plurality 
of predefined sections of said each of a plurality of docu- 
ments. 

18. The system of claim 11, wherein said means for 
selecting a Candidate Document comprises selecting from a 
Document Comparator Database. 

19. A computer program product recorded on computer 
readable medium for collecting information about document 
retrievals over the World-Wide Web, comprising: 

computer readable means for receiving a requesting user 
identity, requested Universal Resource Identifier (URI), 
and a content of a retrieved document; 
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computer readable means for selecting a Candidate Docu- difference between said first Document Comparator 

ment from a Retrieved Document Database, said Can- and said second Document Comparator; and 

didate Document associated with a Candidate Docu- , . . c . , 

ment Ke • computer readable means for comparing said numeric 

. . , r . „ c measure against a predefined Document Difference 

computer readable means for comparing said retrieved 5 Threshold 
document to said Candidate Document to determine a 

sufficiency of said Candidate Document; 22 - T** Program product of claim 20, wherein each said 

computer readable means for associating said retrieved Document Comparator comprises content of said each of a 

document with a newly generated Retrieved Document plurality of documents associated therewith. 

Key if said Candidate Document is not deemed to be 10 23. The program product of claim 22, wherein a URI for 

sufficient, sa l^ Candidate Document is equal to a URI for said retrieved 

computer readable means for adding said retrieved docu- document 

ment to said Received Document Database; and 

. U1 c aa- t m n ♦ 24. The program product of claim 20, wherein each said 

computer readable means for adding a Log File Entry 15 r & r » 

including said requesting user identity, said requested Document Comparator is computed by associating pre- 

URI, and said Retrieved Document Key. defined portions of said each of a plurality of documents to 

20. The program product of claim 19, wherein each of a a binary token. 

plurality of documents in said Retrieved Document Data- 2 5. The program product of claim 20, wherein each said 

base is associated with a Document Comparator and wherein 20 Documen t Comparator comprises a list of significant words 

a first Document Comparator may be compared to a second . . , - . - , 

_ _ , r . J£JJ .. or phrases in said each of a plurality of documents. 
Document Comparator according to a predefined distance 

metr « c * 26. The program product of claim 20, wherein each said 

21. The program product of claim 20, wherein said Document Comparator comprises a Comparator for each of 
computer readable means for comparing to determine a 25 a plurality of predefined sections of said each of a plurality 
sufficiency of said Candidate Document further comprises: of documents. 

computer readable means for computing said first Docu- 27. The program product of claim 20, wherein said 

ment Comparator for said retrieved document; computer readable means for selecting a Candidate Docu- 

computer readable means for retrieving said second 30 ment comprises selecting from a Document Comparator 

Document Comparator for said Candidate Document; Database. 

computer readable means for computing with said Docu- 
ment Comparator Function a numeric measure of a * * * * * 
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